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(54) SPEECH RECOG] 

(71) We, Standard Telephones 

and Cables Limited, a British Company, of 
STC House, 190 Strand, London, W.C.2, 
England, do hereby declare the invention, 

5 for which we pray that a patent may be gran- 
ted to us, and the method by which it is to 
be performed, to be particularly described in 
and by the following statement: — 

This invention relates to speech recognition 

10 apparatus and is particularly applicable to 
man/machine communication interfaces re- 
quired in, for example, the computer in- 
dustry. 

The nature of speech is such that it lends 

15 itself to treatment in terms of binary features, 
at least in classical analysis. Difficulties arise 
because of difficulties in finding acoustic cor- 
relates of the classical distinctive features, or 
in defining any set of acoustic features which 

20 are sufficient for recognition. Even defining 
what is meant by 'sufficient' is not solved in 
any real sense. Generally speaking, such sets 
of binary features as have been defined are, 
moreover, far from statistically independent. 

25 In order to find out if a set of features is 
sufficient for recognition, the most practical 
approach for speech, with its high informa- 
tion content and considerable variability, is to 
adopt an empirical, statistical approach, and 

30 determine error rates. No recognition scheme 
will ever be perfect, because a real input can 
never be sufficiently precisely defined. There- 
fore, one cannot show that a recognition scheme 
will not work simply by producing an example 

35 of speech where it fails. A recognition scheme 
works if it performs up to some acceptable 
standard based on the statistics of its perform- 
ance. 

In the case of speech recognition certain 
40 basic elements are generally accepted as neces- 
sary. A pre-processor, which converts the 
acoustic signal into some form of data; a 
processor which selects and transforms the 
data into a form suitable for decision; and a 
45 classification process which is given a data 
pattern from the processor and classifies it, 
correctly or incorrectly, or rejects it. The aim 
may be to maximise the number of correct 



fITION APPARATUS 

classifications, or minimise the number of in- 
correct classifications. 50 

Since the most practical way to evaluate 
features is to test them in a recognition sys- 
tem, it is not unreasonable to select a really 
good classification procedure (one that is 
optimum, simple, and well understood being 55 
ideal) and find out what its input requirements 
are. The processing sections are then defined 
in terms of input (acoustic signal), output (re- 
quired input for the decision process), and pur- 
pose (features, relevant to recognition, requir- 60 
ing to be detected). In view of the supposed 
'binary opposition' basis of speech perception, 
and die known optimality of the Maximum 
Likelihood Strategy (MLS) which can be 
realised for binary feature spaces, it (the MLS) 65 
is a prime candidate for the decision classifica- 
tion process. 

The maximum likelihood decision is a 
guaranteed optimum procedure, but is only 
solved for rather restricted cases: 70 

(i) where the probability distribution is in 
terms of a binary space of independent 
features. 

(ii) where the probability distribution is 
Gaussian, with equal co-variance 75 
matrices. 

According to the invention there is provided 
a speech recognition apparatus including means 
responsive to selected acoustic characteristics 
for decomposing a signal representing an 80 
acoustic input into analogue signals on parallel 
channels, each analogue signal being represen- 
tative of a different acoustic feature of the in- 
put, means for transforming the analogue sig- 
nals into binary signals on parallel channels, 85 
the binary signals constituting time ordered 
event markers relating to the occurrence or 
occurrences of the respective acoustic fea- 
tures represented by the analogue signals, 
means for generating further binary signals 90 
being time ordered event markers marking the 
occurrence or occurrences of specified sequen- 
ces of event markers relating to the occurrence 
of two or more different features in speci- 
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fled sequences, and means for storing in a 
fixed predetermined sequence binary informa- 
tion representing both the content of the 
acoustic input in terms of the different indi- 

5 vidual acoustic features of the input and the 
content of the acoustic input in terms of the 
specified sequences of acoustic features. 

In a preferred embodiment cf the invention 
the apparatus further includes means for 

10 detennining The likelihood ratio of occurrence 
to non-occurrence of the constituents of the 
stored binary information in comparison with 
a reference pattern and means responsive to 
said ratio whereby a decision is made for 

15 accepting, rejecting or requesting a repeat of 
the acoustic input. 

There are two problems. The features must 
be statistically independent, and they need to 
be presented as a set of binary observations, 

20 rather than the time-varying set of signals 
produced by acoustic analysers. 

The latter problem appears under many 
guises, the common ones in speech being the 
'segmentation 3 problem or the 'time-normalisa- 

25 tion' problem. There are actually two varieties 
of time-dependent information in speech, a 
fact not commonly given explicit recognition. 
One type of information concerns the durauon 
of events, and the other type concerns the order 

30 of events. While not wishing to make any 
claim that there is only one way of dealing 
with this type of duality, it is suggested that 
it clarifies one's thinking, and allows the core 
of the problems to be recognised more easily, 

35 if these two types of time information are 
thought of and handled separately. The further 
suggestion is made that the handling of neces- 
sary duration information is essentially part of 
the acoustic analysis. The output of the acous- 

40 tic analyser then consists of a set of data lines 
which carry data in the form of standard 
pulses. Each channel can, for example, be 
derived from a circuit responsive to seme 
particular characteristic cf speech determined 

45 from the acoustic analysis, and depending on 
duration and/or frequency cues. For a simple 
analysis scheme, examples might be high- 
frequency-energy present for more than time 
T a but less than T h ; high-frequency-energy 

50 present for more than time T b , no-significant- 
feature present for more than time T c but less 
than time T b ; no-significant-feature present 
for more than time T b . There is some evidence 
to suggest that this type of duration analysis, 

55 coupled with signals derived from four octave 
frequency bands, is sufficient to recognise the 
digits (Ross, P. W. 1967— limited vocabulary 
adaptive speech recognition system, presented 
at 23 rd Convention of the Audio Engineering 

60 Soc. 16—19 October, 1967, for example). Two 
points should be noted. First, the ternary 
manner of handling the information, resulting 
from a threshold (below which the event is 
ignored) and binary division of the noticeable 
65 event. 



Such a division is intuitively reasonable for 
similar reasons to those underlying the ternary 
proposal for handling spectral slope features 
(i.e. positive slope, negative slope, no signifi- 
cant slope), and amplitude features. The con- 70 
cept can also be applied to transition rates for 
formants, rates of rise and fall of the mean 
power envelope and the rate of change of slope. 
In some cases there is a division of a notice- 
able event into two magnitude categories, in 75 
other cases there is a division into two sign 
categories. Clearly in the second case it can 
also prove profitable to consider each sign 
category as two magnitude categories, if the 
magnitude has any significance. The second 80 
point to notice is that we have been talking 
about the acoustic analyser, and that the out- 
put consists of binary signal carrying lines, 
which in this illustration consist of related 
pairs. The first line would signal when the £5 
input signal terminated at T x <T b and the 
second when the input signal terminated at 
T x >T b . In another embodiment three out- 
put lines are provided, the third line carrying 
a signal saying that T a has been exceeded. 90 
If a 'dead 3 region were allowed then both 
lines would be on together when mere was 
doubt as to which has occurred. Thus such 
methods convey duration and occurrence in- 
formation about a given feature in an economi- 95 
cal and usable way: i.e. it occurred, starting 
now; it was short, ending now; it was long, 
ending now. 

The above mentioned and other features of 
the invention and the manner of attaining them 10° 
will become more apparent and the invention 
itself will be better understood by reference 
to the following description of an embodiment 
of the invention, taken in conjunction with the 
accompanying drawings, in which: — 105 

Fig. 1 is a block schematic of the major 
sections of an automatic speech recognition 
apparatus: 

Fig. 2 illustrates the effect of time and level 
hysteresis in die determination of primitive 110 
acoustic features in speech; 

Fig. 3 illustrates a set of primitive acoustic 
events for a word, the compound acoustic 
events derived from them and a bit pattern 
associated with the word; m 

Fig. 4 is a sdiematic of the significant 
constituents of one form of acoustic analyser 
for Fig. 1; , 

Fig. 5 is a schematic of the significant con- 
stituents of a feature time-continuity filter with 120 
delay normalisation used in the arrangement 
of Fig. 4; . 

Fig. 6 is a schematic of the significant con- 
stituents of a ternary event detector used in 
the arrangement of Fig. 4; 125 

Fig. 7 illustrates the operation of a ternary 
event detector for the three possible cases of 
inout-pulse duration; 

Fig. 8 is a schematic of one form of control 
circuit for Fig. 1; 
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Fig. 9 is a schematic of the significant con- 
stituents of one form of sequence detector used 
in the arrangement of Fig. 1; 

Fig. 10 is a schematic of the significant con- 

5 stituents of an elementary sequence event 
generator suitable for use in the arrangement 
of Fig. 4, and 

Fig. 11 is a schematic of the significant con- 
stituents of die decision logic used in the 

10 arrangement of Fig. 1. 

In the arrangement shown in Fig. 1 the 
speech is fed to an acoustic analyser 100. A 
set of filters or other detection elements PAC1, 
PAC2 .... PACn decompose the speech into 

15 Primitive Acoustic Characteristics (PAC). 
Each PAC is reduced to a Primitive Acoustic 
Feature (PAF) by corresponding threshold 
devices PAF1 .... PAFn. These threshold 
devices each have a certain degree of hysteresis 

20 so that once a decision has been made regard- 
ing the PAF this decision is adhered to until 
there is good reason to change the decision. 
Such a decision represents the formation of 
a minimum 'null hypothesis' consistent with 

25 the incoming evidence and may consist of 'the 
feature is occurring 5 or 'the feature is not 
occu^Ting^ The hypothesis is not abandoned 
until it is inconsistent with more recent evi- 
dence, rather than merely inadequately sup- 

30 ported. Widiout the ability to make and stick 
to such minimum hypotheses, the machine's 
ability to structure its input — an essential pre- 
liminary to making good decisions — is seriously 
handicapped. In physical terms the effect on 

35 signals is as illustrated in Figure 2A — H. At 
some level of evidence it is necessary to say 
a feature is present, and then stick to this 
decision until the evidence very definitely 
shows that the feature is absent. Consider an 

40 analogue signal containing a PAC the dura- 
tion of which, in real time, is from T x to T 2 
(Fig. 2A), this signal, of course, being un- 
available directly. Two trigger levels tU and 
tl 2 are indicated (Fig. 2B) in relation to the 

45 analogue signal. The output signal when the 
lower trigger level rl t is considered without 
any hysterisis is one indicating the apparent 
occurrence of several PAF's of varying dura- 
tions (Fig. 2C). This output is misleading as, 

50 due to the nature of speech, there is for all 
practical purposes only one PAF of duration 
Tj — T 2 . Incorporating time hysteresis in the 
circuit has the effect of eliminating some of 
the insignificant variations in the input signal. 

55 The hysteresis introduces a time delay r where 
r is the time for which the signal must be con- 
tinuously in one particular state for that state 
to occur as an output. It will be noted that 
a spurious pulse late in time, because of the 

60 hysteresis, incorrectly extends the output 
(Fig. 2D). 

If the trigger level is raised to tl 2 the effect 
of the same degree of hysteresis is to eliminate 
not only the spurious responses but also the 
65 correct responses (Fig. 2E). If a time hysteresis 



r is introduced the analogue signal is never of 
sufficient amplitude for long enough to give 
a significant output (Fig. 2F). Combining the 
two trigger levels widiout time hysteresis, with 
above tJ 2 being 'on' and below di being 'off* 70 
results in a form of amplitude hysteresis which 
eliminates the lesser of the insignificant fluc- 
tuations in the output (Fig. 2G). Introducing 
a time hysteresis r in the combined output 
eliminates the greater insignificant fluctuations 75 
resulting in a proper recognition of the PAF 
with only ?. time delay of r (Fig. 2H). It will 
be noted that both amplitude and time are 
involved in the hysteresis. This is necessary, at 
the practical level, first in order to make a 80 
reasonable representation of the input, and 
secondly to produce an output signal suitable 
for subsequent processing. 

The outputs of PAF1 — PAFn consist of 
signals indicating when certain important fea- 85 
cures of the speech signal are present and 
when they are absent. The content is specified 
by which lines are active, but the order is still 
implicit in their order of output. The order 
is difficult to specify because die signals over- 90 
lap. Before any detection of sequential charac- 
teristics is carried out, therefore, it is neces- 
sary to carry out a little more processing, 
namely to change an extended PAF into two 
events — 'primitive acoustic events' (PAE's) 95 
which are standard pulses marking the time 
when a decision is taken that the feature is 
present, and the time when it is decided that 
the feature is absent. There is a snag. In 
doing this we are, rightly, sorting into signals 100 
which can be ordered meaningfully, but at the 
same time we are consigning information about 
absolute duration of the PAF's to mere im- 
plication in terms of the order of events. This 
may be undesirable simply because the first 105 
event after a feature has begun may be that 
the feature has ended, and if the duration of 
such a feature is significant, then we have lost 
it, for at this stage we are interested in pro- 
cessing for order. The trouble arises because a 110 
distinction by absolute duration, such as that 
between a stop release (say for /t/) and a 
fricative (say /s/), depends on content and not 
order. Thus event detection must also take 
account of absolute duration, and in that way 115 
complete the extraction of content. An event 
detector will therefore have one input, a 
PAF, and N + l outputs, one marking the 
beginning of the PAF, and the others marking 
its end in each of N duration categories. Evi- 120 
dence suggests that N=2 is usual for English. 
In any case, if the duration of a PAF is 
ambiguous — it ends just on the boundary, or 
within half a standard pulse width — then, to 
avoid losing information, and perhaps for other 125 
reasons as well, the occurrence of both the 
possible events should be indicated. Thus, in 
continuous speech, a machine might need to 
consider a silence of ambiguous duration as 
either a stop-gap or an end-of-phrase gap. 130 
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At this stage we have reduced the original in- 
put to a set of primitives— PAFs. Resolution 
of the order is determined by the width of the 
standard pulse representing each such event. 

5 If two events overlap, we cannot assign pre- 
cedence between the pair concerned. How- 
ever we may further process the information 
to extract significant aspects of the sequence 
of events in terms of structural descriptors 

10 called 'compounded acoustic events (CAtsj 
using a grammar based method. 

The determination of the order information 
is performed in the sequence detector ZOO. 
The PAE's are selectively applied to Elcment- 

15 ary Sequence Hements ESE1, ESE2 . . . . 
ESEn. Each has two primary uiput,, for ^ 
events whose precedence is to be computed, 
and a third input into which prph* ted 
event c_' £ eouence breakers —are OK-** 

20 together. One such element corresponds to one 
lefel of recursion of the equivalent computer 
analysing procedure The equivalent funcuon 
in a computer simulation is, of course, carried 
out by a single recursive subroutine. The 

25 grammar is that of a descriptive language for 
Spts (in this case, words— which are audi- 
Treats). The language must describe 
for example, pertinent aspects of the ordering 
of the primitives. The necessary p ttem dw- 

30 cription language may be simple, which would 
compensate for the relative complexity of the 
primitives required for speech. There is an 
overwhelming gain in ope 
when the specification of sub-patterns, their 

35 Radons, c?c, in terms of which *e Pg™ 
is to be analysed, is separated from the mech 
anrm which does the analysis, for the speci- 
fication may easily be changed. Sub-pattern , 
which we may call 'compound acousuc e,ems 

40 are defined in terms of sub-patterns and/or 
primitives only, which is the same : as saying 
CAE's are defined in terms of CAEs and/or 
PAE's only. Let us call both types of event 
simply 'events', where it is not confusing. 

45 There is only one relationship funcuon-thac 
of precedence. Thus sub-patterns or CAEs 
may be defined recursively without specifying 
property functions. The grammar is, however, 
context sensitive. It is necessary to specify, 

50 at each level of the recursions, sub-lists of 
other objects which must not bear a prohibited 
relationship to the object denned at the level 
involved. In less abstract terms, this amount 
to a statement that one can specify that certain 

55 other events must not intervene between the 
two events whose precedence funcuon is being 
evaluated. A chief advantage of recursive 
definition is that the structure of sub-patterns, 
or CAE's, is specified by their name. Thus, 

60 considering a computer simulation, the CAt 
(((A(BC))F)B) would be decomposed by; UK 
analyster programme to a head ((A(BQ£ ; 
and a tail B. The first sub-list of prohibited 
objects would tell the analyser which events 

65 were not allowed to intervene between the 



head and tail at this level. The tail is a 
primitive, and therefore is available as a set 
of PAE pulses marking the times at which the 
event B occurred. If the head were also avail- 
able, then the analyster could establish the 70 
times at which B occurred immediately afte. 
the head event, discounting all intervening 
events except those prohibited. If the head 
were not available (from previous determina- 
uon) then it would be treated as a new event 75 
to be recognised and the process repeated. 
Eventually a level of recursion in the simula- 
tion would be reached at which only primiuves 
or previously recognised events occupied the 
tops of the head and tail stacks that had built 80 
up and the procedure could unwind, generat- 
ing sets of event pulses corresponding to me 
times of the various events on the head ano. 
tail lists, until the time marker pulses for the 
event originally specified were generated. Such 85 
a orammar-controUed analyser has been simu- 
lated on a computer for speech recogniuon 
studies. For considering a hardware embodi- 
ment the specification of these time markers is 
analogous to the identification of picture points 90 
associated with a pattern or sub-pattern lor 
a picture. The final ESE outputs are entered 
as binary information B, . . . . By in a iiit 
Pattern Register 300. The output of the 
sequence detector may be in several forms. 95 
(Note that the sub-patterns are synonymous 
with events as indicated above.) For example: 

(1) A bit pattern, each bit corresponding to 

a particular CAE or PAE, and set to 

'1* if the event in question was detec- 100 

ted. . 

(2) A bit pattern represenung a set or 

•barometer' type counts (count=num- 
ber of bits set), each count represenung 
the number of times a given event 105 
occurred. .... 

(3) A varying bit pattern, held in mono- 

stables whose period would be adjusted 
to the length of the longest period for . 
which a given event might be signifa- 110 
cant. 

Figure 3 illustrates a set of primitive events, 
the compounds derived assuming no other 
event is allowed to intervene (thus all events 
may be said to be 'sequence breakers ), and the 115 
bit patterns derived according to output 

method (1). , . 

Thus the original set of time-varying ana- 
logue signals may, in the manner suggested, 
and as illustrated with reference to a com- 120 
outer based grammar, be translated into a se. 
of non-ordered, binary features, using out- 
put form (1). Decision logic 400 is organised 
on the simple basis of bit-for-bit matching 
with patterns stored as plugs in a three layer 125 
matrix board. These allow presence, absence, 
or don't care conditions to be specified, the 
latter condition obtaining when a plug cot- 
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responding to the feature for the word in 
question is omitted. The whole of the appara- 
tus is run by control circuitry 500. 

The various sections of the arrangement of 

5 Fig. 1 are now described individually in 
greater detail. 

The speech input from microphone m, 
Fig. 4, is first passed through a pre-ampli- 
fier stage 101 and a logarithmic compression 

10 stage 102. The compressed speech signal is 
then fed to three classifying circuits. Two 
of these are respectively a high-pass filter 
103 and a low-pass filter 104. The third sec- 
tion is a total energy detector 105. The high 

15 and low-pass filters each include rectification 
and smoothing in their outputs. The total 
energy detector includes rectification and 
smoothing followed by a trigger circuit which 
comes on when the rectified and smoothed 

20 output of the total energy rises above a set 
threshold (which may be variable by manual 
control) and goes off when the same output 
falls below a second, lower threshold. 

The outputs of the filters 103 and 104 are 

25 fed to a balance circuit 106 which determines 
the ratio of high to low frequency energy. The 
balance circuit incorporates a trigger so 
arranged that the high frequency output is 
not inhibited when the ratio of high to low 

30 frequency energy exceeds a first threshold and 
is inhibited when this ratio falls beiuw a 
second, lower threshold. A similar trigger 
arrangement caters for the opposite condition 
when the low to high frequency energy i alio 

35 exceeds a first threshold or falls below a 
second, lower threshold. If there is a balance 
between the high and low frequency energy 
content a third output is generated indicating 
'both*. The outputs of the balance circuit 106 

40 and the total energy detector 105 may be 
deemed the Primitive Acoustic Characteristics 
of the speech. 

The PACs are fed to feature time-con- 
tinuity filters FTCF1 .... FTCF4. The first 

45 two are concerned with high and low fre- 
quency PACs respectively. FTCF3 is con- 
cerned with the 'both' output from the balance 
circuit and FTCF4 with the total energy of 
the signal. The total energy output is applied 

50 to FTCF4 via gate 1 which delivers u '1' 
when there is silence. This T is also used 
to inhibit the output of gate 3 which carries 
the 'both* output to FTCF3. The both output 
is first applied to gate 2 to deliver a '0' when 

55 the balance ratio is 1:1. The NOR gace 3 
will only deliver an output when gate 1 
delivers a '0', indicating that speech energy 
is present, in conjunction with the '0' from 
gate 2. The PAC inputs to the feature time- 

60 continuity filters may be described as 'hiss?', 
'humph?', 'both?' and 'gap? 1 respectively. 

The purpose of the feature time-continuity 
filter is to produce a signal only when the in- 
put has been present continuously for a pre- 

65 set (manually variable) time, and to stop giv- 



ing an output when the input has been con- 
tinuously absent for a preset period of time. 
The feature time-continuity filter, Fig. 5, 
comprises a pair of integrating one-shot multi- 
vibrators 107, 108 each of which delivers a 70 
positive going output pulse which lasts for a 
time after the last positive going edge fnput. 
The input e.g. a positive pulse T 0 — T« is 
applied to multivibrator 107. The positive 
going leading edge at time T 0 triggers hue- 75 
grating one-shot 107 which delivers a positive 
pulse T 0 — T,. The input is also inverted in 
gate 16 and applied, together with the output 
from 107, to the NOR gate 17. The output 
from 107 will inhibit the output of gate 17 80 
until time T„ so gate 17 delivers a positive 
going pulse T t — T x . This is inverted in gate 
18 and forms the input to the second one- 
shot multivibrator 108, which is therefore 
triggered by the positive going back edge at 85 
time T x to produce a positive output pulse 
T» — T*. This output is applied to the NOR 
gate 19 together with the output from gate 
17 to produce a negative going pulse Tj — T>. 
This pulse T! — T- is inverted by gate 20 90 
whose output effectively constitutes a Primi- 
tive Acoustic Feature but its generation in- 
evitably involves a delay with respect to the 
input. Each FTCF therefore also includes a 
delay normalisation arrangement, adjusted to 95 
make the overall delay equal to a convenient 
standard so that all the PAF's are presented 
simultaneously. The basic PAF output from 
gate 20 is applied to monostable 109, and the 
inverted signal from the preceding gate is 100 
applied to monostable 110. Monostable 109, 
being triggered by the positive going lead- 
ing edge of the PAF input produces a pulse 
T, — T 3 while 110, being triggered by the 
positive going back edge of the inverted PAF, 105 
produces a pulse T. — T 4 . The outputs frcm 
these monostables are inverted by gates 21, 
22 and are applied to a flip-flop 111 which is 
effectively set at time T 3 and re-set at time 
T,. ' no 

The output T s — T. t from the flip-flop 111 
is thus a PAF incorporating 'time hysi uresis 5 
and normalised delay. 

The four PAF outputs. Fig. 4, are applied 
to ternary event detectors TED1 .... TED4 115 
to turn the PAF's into PAE's. 

The operation is clear from Figure 6 and 
the associated waveform diagram, Figure 7. 
A standard pulse, of duration t microseconds, 
is produced by monostable 112 followed by a 120 
drive circuit 113, for an input pulse of any 
length. If the input pulse ends before a vari- 

d 

able monostable 114 period t b stops firing 

2 

then gate 30 never has two simultaneous '0' 
inputs, and therefore gives no output. The 125 
output of gate 23 will be at '0* when the in- 
put ends, giving gate 24 two simultaneous 
'0 1 inputs, until the monostable 114 ceases 
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firin*. A g V pulse is therefore produced at the 
output of sate 24 and, the output of gate 27 
being normally '0', the consequent 0 pulse 
from gate 25 and T pulse from gate 2o lead 

5 to the output of a standard pulse, width t, 
from monostable 115 and drive circuit 116. 

If the input ends during the period ot 
ambiguity (determined by the setting both of 
the first and of the second variable mcno- 

10 stables 114, 117) it will have continued past 
the end of the firing period of the first vari- 
able monostable 114. Gate 30 will, dierefore, 
have received two simultaneous 0 inputs, 
giving rise to a '1' pulse at its output, start- 

15 mg at the end of the period of first variable 
monostable 114, and finishing at the end of 
the input signal. This 'V pulse is inverted by 
gate 31, and the trailing edge thus triggers the 
fixed monostable 118 leading to the produc- 

20 tion of a standard pulse marking the end ol 
the input at the output of drive circuit 119. 
At the same time the leading edge of the pulse 
from gate 30 triggers the second variable 
monostable 117, period d, and for this period 

25 of time a '0' is present at the output of gate 
98 Thus if the input stops before the expira- 
tion of d (i.e. the input stops during the period 
designated as ambiguous), gate 27 will have 
two simultaneous '0' inputs, and a 1 P uis <~ 

30 will appear on the output, starting at the end 
of the input, and ending at the expiration ot 
d This 'I s pulse, acting on gate 25, produces 
a '0' pulse at the input to gate 26 and hence 
a standard pulse from the output of mono- 

35 stable 115 marking the end of the input (since 
the output for gate 24 has remained '0'). Thus, 
an input pulse which is ambiguously close in 
duration to the nominal duration t* will pro- 
duce a standard pulse both from monostable 

40 115 and monostable 118, as required. 

Finally, if the input ends after the second 
monostable 117 has ceased firing, then only 
the standard pulse from monostable 118 will 
be produced. 

45 It is seen, therefore, that monostable 1W 
produces a pulse each time the input PA* 
starts, monostable 115 produces a pulse if the 
input PAF lasts less than t b ; monostable lib 
produces a pulse if the input PAF lasts longer 

50 than r b ; and pulses appear simultaneously 
from monostables 115 and 118 if the input 
PAF duration is ambiguously close to t b . In 
this manner PAF's are transformed into 
PAE's. 

« To return to Figure 4. A freeze level is 
brought into gates 7, 8, 9, 10, 11, 12 to inhibit 
the production of PAFs when the machine is 
frozen (and hence in the output cycle). For 
the 'gap' channel, we cannot inhibit at the 

60 PAF level, because silence will be present at 
the time of freezing, and a spurious end of 
long gap' and subsequently 'beginning of gap 
will be produced. Therefore freezing at this 
level is effected at the PAE, with the three 

65 TED4 outputs being inverted in gates 1^, 14, 



15 and then applied to gates 10, 11, 12 to- 
gether with the freeze level. 

It is convenient to consider the controller 
next. Whenever a 'beginning of gap' signal 
occurs, i.e. from gate 10, Fig. 4, the end-oi 70 
word integrating one-shot 501, Fig. 8 starts 
timing. 

If no PAE from gate 11 occurs between 
the last PAE from gate 10 and the expiration 
of the period of the integrating one-shot, then 75 
gate 36 receives two simultaneous '0° inputs 
and the output goes to '1* starting at the 
instant that the monostable period expires. 
This triggers the display monostable 502, 
and the leading edge of the output sets the 80 
control bistable 503, which, in turn, sets the 
freeze level via the freeze drive 504 to T, 
freezing the machine. 

Note that the PAE from gate 12 line is 
taken to the start bistable 505. The first PAE 65 
produced for any word, if the beginning of 
the word is not missed, must be a PAE from 
gate 12 'end of long gap'. If this is not so, 
either too much noise preceded the word, or 
the speaker started speaking before the 90 
machine 'unfroze' from the last operation. 
Thus, if the start bistable is still in the reset 
condition when the machine freezes, some 
sort of error has occurred. The start bistable 
levels are therefore used to inhibit the com- °5 
puting indicator drive, and the output level 
drive, when in the reset condition, and to 
allow the ready or error indicators to be 
driven depending on the state of the control 
bistable 503 : when in the set condition it in- 100 
hibits the error and ready indicator drives, 

and depending on the state of the control 

bistable — allows the computing indicator or 
output level and indicator to be driven. 

Continuing now from the last paragraph 105 
but one. When the machine is frozen, depend- 
ing on whether or not a valid start was 
obtained, either the output level will also 
appear, or an error indication will be made, 
and output suppressed. The machine stays HO 
frozen in the output cycle until the display 
monostable period' expires. Gate 37 inverts 
the output of the display monostable so that 
the trailing edges fires the reset monostable 
506. If the switch following 506 is set to 115 
'auto 5 the output of the reset monostable 
produces a reset level via the reset drive 507 
which clears the control and start bistables 
503, 505. It is also taken to other parts of 
the machine to clear the memory and out- 120 
put stores of the sequence detector. Thus 
the reset level puts the machine in the ready 
state, cleared for action, and 'unfrozen'. 

The switch is provided to inhibit resetting, 
and to allow manual resetting, if desired. The 125 
outputs of 503 and 505 are gated m gates 32, 
33 34 and 35 to obtain the required ready", 
'error', 'computing' and 'output* indicator 
signals. The output signal is derived via an 
output drive circuit 508. 
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The sequence detector 200, uses ESE's 
(Elementary Sequence Elements) to carry out 
first order Sequence Detection on the basis 
of selected PAE's. Each of the ESE's has two 
5 main inputs, and two auxiliary inputs. The 
operation may be described with reference to 
Figure 10. 

The purpose of the ESE is to produce a 
standard pulse out when the two main inputs 

10 are sequentially activated. If we call the two 
main inputs i and j, then one output gives a 
pulse when j occurs, following i, and the 
other gives a pulse out when i occurs, fol- 
lowing j. We may designate these pulses 

15 ije a and ne^ They are standard pulses, of 
duration t microseconds. If the two main in- 
puts overlap, or if the input labelled S/B, 
for Sequence Breaker, is activated between 
the occurence of one main input and the 

20 other, then no output occurs; the device is, 
instead, reset appropriately. The occurrence 
of either main input is 'remembered', if either 
persists, by itself, after the other inputs have 
stopped. The device is symmetrical with 

25 respect to the two inputs and outputs, so the 
operation of half will now be detailed. 

If a pulse appears at i, and no other in- 
put, then the bistable, comprised of gates 40 
and 41, is set, and a 'O' appears on the top 

30 input to gate 42. If a pulse then appears at 
j. and no other input, the output of gate 44 
falls to zero, during the pulse, and a 'V pulse 
is therefore produced from gate 42, which 
momentarily has four simultaneous e 0' inputs, 

35 presuming the other two inputs are at e 0\ 
This T pulse causes the monos table 202 to 
fire, and an output pulse is produced via 
drive circuit 203. The output signal also is 
fed back to clear the memory bistable, the 

40 additional connection to gate 39 ensuring that 
there is no ambiguity in the resetting opera- 
tion due to i becoming active again. The 
device is then ready to register another 
'Elementary Sequence'. 

45 Gate 43 produces a e l* output if both i and 
j are -present at the same time. This prevents 
eidier output being activated by either in- 
put/bistable combination by inhibiting gate 
42, and also resets the memory bistables — 

50 ambiguity being prevented by the cross- 
connections from i to gate 45 and j to gate 
. I 39. Whichever input lasts longest will even- 
tually be remembered, as is appropriate. 
If a Sequence Breaker occurs, then, again, 

55 the memory bistables are positively reset, and 
the output is inhibited by the connections to 
gates 42 and 48. Thus a Sequence Breaker 
occurring in the middle of an Elementary 
Sequence does 'break the sequence'. 

60 The input marked R/S, for Reset, is acti- 
vated by the reset level generated by the 
controller, and simply clears the memory 
bistables ready for another operation. There 
is no problem of conflict with other signals, 

65 since the machine is frozen at the instant of 



resetting, though it could provide additional 
protection to insert a slight delay in the re- 
setting of the control bistable, to make sure 
the machine is positively reset befGre it is 
'unfrozen'. The final outputs of some if not 70 
all the ESE's are entered directly into the 
Bit Pattern Register 300, Fig. 9 also shown in 
Fig. 1. 

The final section of the machine, the 
Decision taker 400, Fig. 1, is straightforward 75 
gating logic as shown in Fig. 11. The matrix 
may be, in practice, a three layer plug-board. 
The strips in one layer of such a plug-board 
matrix may be shorted to the strips in either 
of the other two layers, which strips are 80 
arranged at right-angles to the strip of the 
first layer. By putting in suitable plugs a 
pattern of input states may be selected for 
each desired output, so that the output only 
comes on when the specified inputs are in 85 
the specified states — combinations of *1* and 
'0'. Lamps and drivers are provided to allow 
the operation to be monitored, and to allow 
the matrix rows and outputs to be driven. 

The decision taker is thus a straight pattern 90 
matching arrangement. 

WHAT WE CLAIM IS: — 

1. Speech recognition apparatus including 
means resposive to selected acoustic charac- 
teristics for decomposing a signal represent- 95 
ing an acoustic input into analogue signals 

on parallel channels, each analogue signal 
being representative of a different acoustic 
feature of the input, means for transforming 
the analogue signals into binary signals on 100 
parallel channels, the binary signals constitut- 
ing time ordered event markers relating to the 
occurence or occurrences of the respective 
acoustic features represented by the analogue 
signals, means for generating further binary 105 
signals being time ordered event markers 
marking the occurrence or occurrences of 
specified sequences of event markers relating 
to the occurrences of two or more different 
features in specified sequences, and means for 110 
storing in a fixed predetermined sequence 
binary information representing both the con- 
tent of the acoustic input in terms of the 
different individual acoustic features of the 
input and the content of the acoustic input 115 
in terms of the specified sequences of acous- 
tic features. 

2. Apparatus according to claim 1 in which 
the means for decomposing the signal rep- 
resenting an acoustic input includes a plur- 120 
ality of filters each arranged to pass a different 
range of frequencies and means for produc- 
ing from the filters a plurality of outputs each 
indicating the relative amplitudes of one of 

the filter outputs with respect to another filter 125 
output. 

3. Apparatus according to claim 1 or 2 
including means for detecting the total energy 
in the input and means for producing from 
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the total energy detector an output indicating 
that the total energy exceeds a predetermined 
threshold level. u 

4. Apparatus according to claim 3 as 

5 appended to claim 2 in which the means for 
producing the outputs indicative of the rela- 
tive amplitudes of the filter outputs each 
include trigger means having a first threshold 
level whereby the output is not inhibited 

10 when the ratio of one filter output amplitude 
to another filter output amplitude exceeds 
the first threshold and a second lower thresh- 
old level whereby the output is inhibited 
when the ratio falls below the second thresn- 

15 old level. . . . 

5 Apparatus according to claim 4 includ- 
ing ' means for producing an output when 
there is a balance between two filter outputs. 

6 Apparatus according to claim 5 includ- 
20 ing means for inhibiting the balance output 

when the total energy does not exceed the 
predetermined threshold level. 

7. Apparatus according to claim 4, 3 or 
6 including a plurality of pulse generators 

25 to each of which is applied one of the outputs 
derived from the filters and the total energy 
detector, each pulse generator being arranged 
to produce an output pulse the start of which 
occurs only when the input to the pulse 

30 venerator has been present continuously for 
a predetermined period of time and the end 
of which occurs only when the input has been 
absent continuously for a predetermined 
period of time. . 

35 8. Apparatus according to claim 7 wherein 
each pulse generator includes a pair of one- 
shot multivibrators each of which delivers an 
output pulse which lasts for a predetermined 
period of time after an input has been 

40 applied, means for triggering one of the 
multivibrators from the leading edge of an 
input signal derived from a filter or total 
energy detector, means for triggering the 
other multivibrator from the trailing edge of 

45 the input signal, and gating means for gating 
the input signal with die pulses generated by 
the two multivibrators whereby the gated 
input signal forms the output of the pulse 
generator. . . 

50 9. Apparatus according to claim 8 includ- 
ing means for normalising the delays occur- 
ring in the outputs of a plurality of pulse 
generators. . 

10. Apparatus according to claim 8 or ~ 



including a plurality of means for generating 
binary information signals each producing an 
output according to the sigmficance and 
duration of the output of one of the pulse 
generators. . 

11 Apparatus according to claim iu in- 
cluding a plurality of gating logic means each 
of which is responsive to two or more binary 
input signals whereby the relative sequential 
occurrence of those signals can be determined 
and means for generating a binary output 
signal according to the relative sequential 
occurrence of the binary input signals. 

12. Apparatus according to claim 11 where- 
in the binary input signals for some of the 
mating logic means are those derived from 
the pulse generators and the binary output 
signals of some of the gating logic means 
form the binary input signals to other gating 
logic means. . 

13. Apparatus according to claim 11 or lz 
including means for storing in a predeter- 
mined sequence the binary output signals 
from some of the gating logic means, and 
means for comparing the stored information 
pattern with predetermined binary informa- 
tion patterns. 

14. Apparatus according to claim a 
wherein the means for storing the binary out- 
put signals includes one or more monostables. 

15. Apparatus according to claim 13 or 
14 including means for determining the like- 
lihood ratio of occurrence to non-occurrence 
of the constituents of the stored binary in- 
formation in comparison with a reference 
pattern and means responsive to said rauo 
whereby a decision is made for accepting, 
rejecting or requesting a repeat of the acoustic 

input. . , 

16. Apparatus according to any one or tne 
preceding claims 10 to 15 including means 
for freezing the operation of or output from 
each of the plurality of means for generating 
binary information signals indicating the 
significance and duration of outputs of the 
pulse generators. 

17. Speech recognition apparatus substan- 
tially as described with reference to the 
accompanying drawings. 

G. H. EDMUNDS, 
Chartered Patent Agent, 
For the Applicants. 
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